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Abstract 



The paper investigates a novel approach, based on Constraint Logic Programming (CLP), 
^ ' to predict the 3D conformation of a protein via fragments assembly. The fragments are 

c7^ , extracted by a preprocessor — also developed for this work — from a database of known 

00 . protein structures that clusters and classifies the fragments according to similarity and 

frequency. The problem of assembling fragments into a complete conformation is mapped 
to a constraint solving problem and solved using CLP. The constraint-based model uses a 
[~~. ' medium discretization degree Ca-side chain centroid protein model that offers efficiency 

c7^ ' and a good approximation for space filling. The approach adapts existing energy models to 

^3 . the protein representation used and appfies a large neighboring search strategy. The results 

shows the feasibility and efficiency of the method. The declarative nature of the solution 
allows to include future extensions, e.g. different size fragments for better accuracy. 



3 , 1 Introduction 

Proteins are central components in the way they control and execute the vital 
functions in living organisms. The functions of a protein are directly related to 
its peculiar three-dimensional conformation. Knowledge of the three-dimensional 
conformation of a protein (also known as the native conformation or tertiary struc- 
ture) is essential to biomedical investigation. The native conformation represents 
the functional protein and determines how it can interact with other proteins and 
affect the functions of the hosting organism. It is impossible to clearly understand 
the behavior and phenotype of an organism without knowledge of the native con- 
formation of the proteins coded in its genome. As a result of advancements in DNA 
sequencing techniques, there is a large and growing number of protein sequences — 
i.e., lists of amino acids, also known as primary structures of proteins — available 
in public databases (e.g., the database Swiss-prot contains about 500,000 protein 
sequences). On the other hand, knowledge of structural information (e.g., infor- 
mation concerning the tertiary structures) is lagging behind, with a much smaller 
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number of structures deposited in public databases, notwithstanding that structural 
genomics initiatives started worldwide. 

For these reasons, one of the most traditional and central problems addressed by 
research in bioinformatics deals with the protein structure prediction problem, i.e., 
the problem of using computational methods to determine the native conforma- 
tion of a protein starting from its primary sequence. Several approaches have been 
explored to address this problem. In broad terms, it is possible to distinguish be- 
tween two main classes of approaches. The more traditional term protein structure 
prediction has been commonly used to describe methods that rely on comparisons 
between known and unknown proteins to predict the end-result of the spontaneous 
protein folding process (also known as homology modeling). The more expensive 
task of predicting a protein structure starting from the knowledge of the chemi- 
cal structure and laws of physics (known as de-novo/ab-initio prediction) has been 
studied with different levels of accuracy and complexity. 

Instead, protein folding simulations tries to understand the folding path leading 
to the native conformation, typically using investigations of the potential energy 
landscape or using molecular dynamics simulations. In both classes of methods, 
knowledge of "patterns" can be used to restrict the search space — and this is par- 
ticularly true for the case of secondary structure components of a protein, i.e., local 
helices or strands. Secondary structure components are important, considering that 
their formation is believed to represent the earliest phase of the folding process, and 
their identification can be relatively simpler (e.g., through low- resolution observa- 
tions of images from electron microscopy). 



1.1 Related work 

As mentioned above, it is possible to predict protein structures based on their se- 
quences, using either homology modeling or fold recognition techniques. Neverthe- 
less, it is in general difficult to predict a protein structure based only on its sequence 
and in absence of structural templates. Explicit solvent molecular dynamics simula- 
tions of protein folding are still beyond current computational capabilities. Already 
in 1968, Levinthal postulated that the systematic exploration of the space of possible 
conformations is infeasible (jLevinthal 1968i) . This complexity has been confirmed 
by theoretical results, showing that even extremely simplified formalizations of the 
problem lead to computationally intractable problems (jCrescenzi et al. 1998p . Re- 
cently, ab-initio methods for generating protein structures given their sequences 
have been proposed and successfully used ()Ben-David et al. 2009| . Key elements of 
these methods are the use of evolutionary information from multiple alignments, 
the adoption of simplified representations of proteins, and the introduction of frag- 
ments assembly techniques. These methods rely on assembling the structure using 
simplified representations of protein fragments with favorable conformations (ob- 
tained from structural databases) for the profile of the given sequence. Three to 
nineteen- residue fragments (i.e., 3-9 for small proteins and 5-19 for large ones) 
contain correlations among neighboring residue conformations, and most of the 
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computation time is spent in finding the global arrangement of fragments that 
already display good local conformations ([Raman et al. 20091 IWu et al. 2007| ). 

Several simplified models have been introduced to address the problem. Simpli- 
fied models abstract several properties of both proteins and space, leading to a 
version of the problem that can be solved more efficiently. The solutions of the sim- 
plified problem constitute candidate configurations that can be refined with more 
computationally intensive methods, e.g., molecular dynamics simulations. Possi- 
ble simplifications include the introduction of fixed sizes of monomers and bond 
lengths, the representation of monomers as simple points, and viewing the three- 
dimensional space as a discretized collection of points. In these simplified models, it 
is possible to view the protein folding problem as an optimization problem, aimed at 
determining conformations that minimize an energy function. The energy function 
must be defined according to the simplified model adopted (jBerrera et al. 20031 
[Fogolari et al. 20071 ) ■ Simplified energy models have been devised specifically to 
solve large instances of the structure prediction problem using constraint solving 
approaches (Backofen and Will 20061 IDotu et al. 2008^ . 

In our own previous efforts, we have applied declarative programming approaches, 
based on constraint solving, to address the problem. We built our efforts on a dis- 
crete crystal lattice organization of the allowable points in the three-dimensional 
space. This representation exploits the property that the distance between the Ca 
atomqj of two consecutive amino acids is relatively constant (3.8A). The problem 
is viewed as placing amino acids in the allowable points, in such a way that con- 
straints encoding the mutual distances of amino acids in the primary sequence are 
satisfied (|Dal Palu et al. 20041 IDal Palu et al. 2007^ . The original framework has 
also been expanded to support several types of global constraints, i.e., constraints 
describing complex relationships among groups of amino acids. One of these con- 
straints is the rigid structure constraint — this constraint enables the representation 
of known substructures (e.g., secondary structure components), reducing the prob- 
lem to the determination of an appropriate placement and rotation of such substruc- 
tures in the lattice space. The ability to use rigid structure constraints has been 
shown to lead to dramatic reductions in the search space (|Dal Palii et al. 20101 



Barahona and Krippahl 2008 1 . However, exploiting knowledge of real rigid sub- 
structures in a discrete lattice is infeasible, due to the errors introduced by the 
approximations imposed by the discretized representation of the search space. On 
the other hand, the use of a less constrained space model leads to search spaces 
that are intractable. These two considerations lead to the new approach presented 
in this work. 



1.2 The contribution of this work 

Some of the most successful approaches to protein folding build on the principles of 
using substructures. The intuition is that, while the complete folding of a protein 
may be unknown, it is likely that all possible substructures, if properly chosen, 

^ A carbon atom that is a convenient representative of the whole amino acid 
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can be found among proteins whose conformations are known. The folding can be 
constructed by exploiting relationships among substructures. A notable example of 
this approach is represented by Rosetta ([Raman et al. 2009| — an ab initio protein 
structure prediction method that uses simulated annealing search to compose a 
conformation, by assembling substructures extracted from a fragment library; the 
library is obtained from observed structures stored in the Protein Data Bank {PDB, 
[www ■ pdb ■ orgp . 

In this work, we follow a similar idea, by developing a database of amino acid 
chains of length 4; these are clustered according to similarity, and their frequencies 
are drawn from the investigation of a relevant section of the Protein Data Bank. The 
database contains the data needed to solve the protein folding problem via frag- 
ments assembly. Declarative programming techniques are used to generate clean 
and compact code, and to enable rapid prototyping. Moreover, the problem of as- 
sembling substructures is efficiently tackled using the constraint solving techniques 
provided by CLP{J^'D) systems. 

This paper has the goal of showing that our approach is feasible. The main ad- 
vantage, w.r.t. a highly engineered and imperative tool, is the modularity of the 
constraint system, which offers a convenient framework to test and integrate statis- 
tical data from various predictors and databases. Moreover, the constrained search 
technique itself represents a novel method, compared to popular predictors, and 
we show its effectiveness in combination with the development of new energy func- 
tions and heuristics. The proposed solution includes a general implementation of 
large neighboring search in CLP {TV), that turned out to be highly effective for 
the problem at hand. Another contribution is the development of a new energy 
function based on two components: a contact potential for backbone and side chain 
centroids interaction, and an energy component for backbone conformational pref- 
erences. Backbone and side chain steric hindrances are imposed as constraints. 



2 Protein Abstraction 

2.1 Preliminary notions 

We focus on proteins described as sequences of amino acids selected from a set A 
of 20 elements (those coded by the human genome). In turn, each amino acid is 
composed of a set of atoms that constitute the amino acid's backbone (see Fig. [T]) 
and a set of atoms that differentiate pairs of amino acids, known as side chain. One 
of the most important structural properties is that two consecutive Ca atoms have 
an average distance of 3.8A. Side chains may contain from 1 to 18 atoms, depending 
on the amino acid. For computational purposes, instead of considering all atoms 
composing the protein, we consider a simplified model in which we are interested 
in the position of the Ca atoms (representing the backbone of the protein) and 
of particular points, known as the centroids of the side chains (Fig. [3|). A natural 
choice for the centroid is the center of mass of the side chain. 

It is important to mention that, once the positions of all the Ca atoms and 
of all the centroids are known, the structure of the protein is already sufficiently 
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determined, i.e., the position of the remaining atoms can be identified almost de- 
terministically with a reasonable accuracy. 
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Fig. 1. Two consecutive amino acids and the clustering of amino acids into 9 classes 

Focusing on the backbone and on the Ca atoms, three consecutive amino acids 
define a bend angle (see 9 in Fig. [2] — left). Consider now four consecutive amino 
acids 01,02,03,04. The angle formed by 712 — (04 — 03) x (03 — 02) and rii — 
(03 — 02) X (02 — oi) is called torsional angle (see (f> in Fig. [2] — right). If these angles 
are known for all the consecutive 4-tuples forming a protein, they uniquely describe 
the 3D positions of all the Ca atoms of the protein. 





Fig. 2. Bend (left) and torsional (right) angles 



Given a spatial conformation of a 4-tuple of consecutive Ca atoms, a small degree 
of freedom for the position of the side chain is allowed — leading to conformers 
commonly referred to as rotamers. To reduce the search space, we do not consider 
such variations. Once the positions of the Ca atoms are known, we deterministically 
add the positions of the centroids. In particular, the centroid of the i-th residue is 
constructed by using the positions of Ca^-i, Cai and Ca^+i as reference and by 
considering the average of the center of mass of the same amino acid type centroids, 
sampled from a non- redundant subset of the PDB. The parameters that uniquely 
determine its position are: the average Ca-side chain center of mass distance, the 
average bend angle formed by the side chain center of mass-Cai-Cofi+i, and the 
torsional angle formed by the Cai^i-Cai-Cai+i-side chain center of mass. Even 
with this simplification, the introduction of the centroids in the model allows us to 
better cope with the layout in the 3D space and to use a richer energy model. In 
Fig. [31 we report an example of this abstraction with a fragment with 10 alanines 
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(ALA). For these amino acids, the centroids coincide with the only heavy atom of 
each sidechain. 

This has been experimentally shown to produce more accurate results, without 
adding extra complexity w.r.t. a model that considers only the positions of the Ca 
atoms and without the use of centroids. 





Fig. 3. A fragment of 10 ALA amino acids in all- atom and Ca-centroid represen- 
tation 



2.2 Clustering 

Although more than 60,000 protein structures are present in the PDB, the complete 
set of known proteins contains too much redundancy (i.e., very similar proteins 
deposited in several variants) to be useful for statistical purposes. Therefore we 
focused on a subset of the PDB called top-500 (jLovell et al. 2003| rlThis set contains 
500 proteins, with 107, 138 occurrences of amino acids. The number of different 
4-tuples occurring in the set is precisely 62,831. Since the number of possible 4- 
tuples of amino acids is jA''! = 20^ = 160,000, this means that most 4-tuples do 
not appear in the selected set; even those that appear, occur too rarely to provide 
significant statistical information. For this reason, we decided to cluster amino 
acids into 9 classes, according to the similarity of the torsional angles of the pseudo 
bond between two consecutive Ca atoms ( [Fogolari et al. 2007 1 . Note that, in our 



simplified model, two consecutive Cq atoms do not form a covalent bond and this 
fact is indicated by the term pseudo. 

Let 7 : A — > {0, . . . , 8} be the function assigning a class to each amino acid, as 
defined in Fig.[T] for i G {0, . . . , 8}, let us denote with 7~^(i) = {a € A : 7(a) = i}. 
In this way, the majority of the 9* = 6, 561 4-tuples have a representative in the 
set (precisely, there are templates for 5,830 of them). 

A second level of approximation is introduced in deciding when two occurrences 
of the same 4-tuple have the "same" form. The two tuples are placed in the same 
class when their _Root Mean Square Deviation (RMSD) is less than or equal to a 
given threshold (rmsd_thr); this threshold is currently set to l.OA. We developed a 
C program, called tuple_generator, that creates a set of facts of the form: 

tuple( [51, 52, 53, 54], [Xf , Ff , Zf , X^, Y^, Z^ , Xf , Y^ , Z^ , X^ , Y^, Z^\ 
(72~centroid-list, 53-centroid-list, FREQ, ID, PID) 

^ Note that our program tuple_generator, developed to extract the desired information, can work 
on any given set of known proteins. 
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where [gi, g2, gs, g^] S {0, . . . , 8}^ identifies the class of each amino acid, X", . . . , Z" 
are the coordinates of the 4 Ca atoms of the 4-tuple|3 FREQ G {0, . . . , 1000} is a 
frequency factor of the template w.r.t. all occurrences of the 4-tuple gi, . . . , g^ in 
the set top-500, ID is a unique identifier for this fact, and PID is the first protein 
found containing this template; this last piece of information will be printed in the 
file we produce as output of the computation, in order to allow one to recover the 
source of a fragment used for the prediction. 

As discussed in Sect. 12.11 we model the position of the centroid of the side chain 
of every amino acid, gi G {0, . . . , 8} is a representative of the class of amino acids 
l~^ig^) (e.g., 7~^(2) = {ARG, GLU, GLN, LYS}~see Fig.H]). For each amino acid 
a G J^^igi), we compute the position of the centroid corresponding to the positions 
X", . . . , Z" of the Ca atoms, and add it to the ^i-centroid-list. Let us observe 
that we do not add the position of the first and last centroid in the 4-tuples. As 
a result, at the end of the computation, only the centroid of the first and the last 
amino acids of the entire protein will be not set; these can be assigned using a 
straightforward post-processing step. 

It is unlikely that a 4-tuple oi , . . . , 04 that docs not appear in the considered 
training set will occur in a real protein. Nevertheless, in order to handle these cases, 
if [7(01), . . . ,7(04)] has no statistics associated to it, we map it to the special 4- 
tuple [—1, —1, —1, —1]. By default, we assign to this unknown tuple the set of 6 most 
common templates (the number can be easily increased) among the set of all known 
templates. Other special 4-tuples are [—2, —2, —2, —2] and [—3, —3, —3, —3]; these 
are assigned to a-helices and /3-sheets every time a secondary structure constraint 
is locally enforced on a region of the conformation. Handling of these special tuples 
will be described in detail in Section 13.11 

We also introduce an additional collection of facts, based on the predicate next, 
which are used to relate pairs of tuple facts. The relation next(lDi , ID2, Mat) holds 
if the last three amino acids of the sequence in the tuple fact identified by IDi are 
identical to the first three amino acids of the sequence in the tuple fact ID2, and 
the corresponding Ca sequences are almost identical — in the sense that the RMSD 
between them is at most rmsd_thr. Mat is the rotation matrix to align the 1-3 Ca 
atoms of the 4-tuple of ID2 with the 2-4 Ca atoms of the 4-tuple of IDi. 

2.3 Statistical energy 

The energy function used in this work builds on two components: (1) a contact 
potential for side chain and backbone contacts and (2) an energy component for each 
backbone conformation based on backbone conformational preferences observed in 
the database. The first component uses the table of contact energies described 
in (jBerrera et al. 20^3|) . This set of energies has been shown to be accurate, and 
it is the result of filtering the dataset to allow accurate statistical analysis, and a 
consequent fine tuning of the contact definition. Since not all side chain atoms are 
represented, it is not possible to directly use the definition of (jBerrera et al. 20"03l) : 

3 Without loss of generality, we set (Xf , Y{' , Zf ) = (0, 0, 0). 
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therefore, we retain the table of contact energies of ([Berrera et al. 2003]) . but we 
adapt the contact definition to the side chain centroid. In particular, we use the 
value 3 • 2A as the shortest distance between two centroids, and we set the contact 
distance between two centroids to be less than or equal to the sum of their radii. 
Larger distances provide a contribution that decays quadratically with the distance. 
The radius of each centroid is computed as the centroid's average distance from 
the Ca atom. No contact potential is assigned to the side chain's Ca, as the side 
chain definition already includes the Ca carbon. The energy assigned to the con- 
tact between backbone atoms represented by Ca (not included in the side chain 
definition) is the same energy assigned to ASN-ASN contacts, which involve mainly 
similar chemical groups contacts. The torsional angle defined by four consecutive 
Ca atoms is assigned an energy value defined by the potential of the mean force 
derived by the distribution of the corresponding torsional angle in the PDB. The 
procedure has been thoroughly described in ( [Fogolari et al. 2007[ ). 

3 Modeling 

In this section, we describe the modeling of the problem of fragments assembly 
using constraints over finite domains — specifically, the type of constraints provided 
in CLP{T'D). The input is a list Primary of n amino acids. We will denote with Ui 
the i*^ element of the primary sequence. We also allow PDB identifiers as inputs; 
in this case, the primary structure of the protein is retrieved from the PDB. A list 
of n — 3 variables (Code) is generated. The i-th variable Ci of Code corresponds to 
the 4-tuple (7(04), . . . ,7(0^+1), 7(0^+2), 7(02+3))- The possible values for Ci are the 
IDs of the facts of the form: 

tuple([7(aj), 7(a,+i), 7(0^+2), 7(0^+3)], _, _, Freq, ID, _)• 

This set is ordered using the frequency information Freq in decreasing order, and 
stored in a variable ListDonii. 

The next information is used to impose constraints between Ci and C^+i. Using 
the combinatorial constraint table, we allow only pairs of consecutive values sup- 
ported by the next predicate. Recall that, for each allowed combination of values, 
the next predicate returns the rotation matrix Mi.i+i, which provides the relative 
rotation when the two fragments are best fit. 

Another list with 3n variables (Tertiary) is also generated: X", Y", Zf (resp., 
Xf, Yf , Zf) denoting the 3D position of the Ca atoms (resp., of the centroids). 
These variables have integer values (representing a precision of 10~^A). 

In order to correlate Code variables and Tertiary variables, consecutive 4-tuples 
must be constrained. Let us focus on the Ca part; consider two consecutive tuples: 

• ti = ai, fli+i, ai+2, ^1+3 with code variable Ci, and 

• ti+i = tti^i, ai^2, 0,1+3, o-i+i, with code variable C^+i and local coordinates 

Vi, V2, V3, n. 

ii+i is rotated as to best overlap the points in common with ti, and it is placed so 
that the last point in i^+i is at 3.8A from the last point in ti. 

Let X", Y" , Z" , . . . , X°j_4, Y^^, Z^^^ be the variables for the coordinates of these 
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Fig. 4. Consecutive fragments combination 

Ca atoms, stored in the list Tertiary (Fig. El where P, = (Xf , Y°',Z°')). The 
constraint introduced rotates and translates the template t^+i from the reference 
of Ci (represented by the orthonormal basis matrix Ri) according to the rotation 
matrix Mi.i+i to the new reference Ri+i = Ri x Mi^i+i. Moreover, when placing the 
template ti+i, the constraint affects only the coordinates of 0^+4, since the other 
variables are assigned by the application of the same constraint for templates tj , 
j < i + 1. The constraint shifts the rotated version of f^+i so that it overlaps 
the third point V3 with (X.'^g, Yi'^3,^^3). Formally, let V^ = Ri+i x V,,, with 
k G {1 . . .4}, be the rotated 4-tuple corresponding to C^+i. The shift vector s = 
(X°^3, Y°^^, Z"_^^) — V^ is used to constrain the position of 0^+4 as follows: 

(^i+4i ^j+4j ^H-4) = S + Ri + 1 X ^4 

Note that the 3.8A distance between consecutive amino acids (i.e., a^+a and 0^+4) 
is preserved, and this constraint allows us to place templates without requiring an 
expensive RMSD fit among overlapping fragments during the search. Moreover, 
during a leftmost search, as soon as the variable Ci is assigned, also the coordinates 
(Xi^3, Yi+s, ^i+3) are uniquely determined. 

Matrix and vector products are handled by FD variables and constraints, using 
a factor of 1000 for storing and handling rotation matrices. 

For the sake of simplicity, we omit the formal description of the constraints asso- 
ciated to the centroids. The centroids' positions are rotated and shifted accordingly, 
as soon as the corresponding positions of the Ca atoms are determined. 

The Xf , Yf, Zf , . . . , X^, Y^, Z^ part of the Tertiary hst relative to the posi- 
tion of the Ca atoms, is also required to satisfy a constraint which guarantees the 
all_distant property (jDal Palii et al. 2010|) : the Ca atoms of each pair of non- 
consecutive amino acids must be distant at least D = 3 ■ 2A. This is expressed by 
the constraint: 

(Xf - x/)2 + ( Y," - y/)2 + (Zf - Z,")2 > D^ 

for all i G {l,...,n — 2} and j G {i + 2,...,n}. Similar constraints are imposed 
between pairs of Ca and centroids as well as pairs of centroids. In the latter case, in 
order to account for the differences in volume of each possible side chain, we deter- 
mine minimal distances that depend on the specific type of amino acid considered. 
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Another constraint is added to guide the search: a diameter parameter is used 
to bound the maximum distance between different Ca atoms (i.e., the diameter of 
the protein). As we argued in earher work (|Dal Palia et ah 2004p . a good diameter 
value is 5 • 68 'nP'^^ A. We impose this constraint to all different pairs of Ca atoms. 



3.1 Secondary information 

The native structure of a protein is largely composed of some recurrent local struc- 
tures. These structures, a-helices and /3-sheets, can be predicted with good accuracy 
using efficient techniques, such as neural networks, or recognized using other tech- 
niques (e.g., analysis of density maps from electron microscopy). Being based on 
frequency analysis, our tool is able to discover the majority of secondary structures. 
On the other hand, a-priori knowledge of these structures allows us to remove sev- 
eral non-deterministic choices during computation. Therefore, we allow knowledge 
of secondary structures to be provided as part of the input — e.g., information in- 
dicating that the amino acids i-j form an a-helix. In the processing stage, for 
fc G {?,... ,j — 3}, a particular tuple [—2, —2, —2, —2] is assigned instead of the tu- 
ple [7(afc), . . . , 7(afc+3)]. This tuple has a unique template which is associated to an 
a-helix, built in a standard way using a bend angle of 93.8 degrees and a torsional 
angle of 52.3 degrees. Moreover, a list of the possible positions for the centroids of 
the 20 amino acids is retrieved. Since the domains for these Cfc's are singletons, 
as soon as Ci is considered for value assignment, all the points of the helix are 
deterministically computed. The situation in the case of /3-strands is analogous. 

A variation of this technique can be used if larger and more complex substructures 
are known. Basically, even keeping just 4-tuples as internal data structures, we can 
easily deal with tuples of arbitrary lengths. Automated manipulation of arbitrary 
complex structures is subject of future work. 

4 Searching 

The search is guided by the instantiation of the Ci variables. These variables are in- 
stantiated in leftmost-first order; in turn, the values in their domains are attempted 
starting with the most probable value first. We observed that a first-fail strategy 
does not speed up the search, probably due to the weak propagation of the matrix 
product constraints. As described in Section 12.31 we associate an energy to each 
computed structure. The energy value is an FD constraint that links coordinates 
variables and amino acids. Given the model of the problem, this kind of constraint is 
not able to provide effective bounds for pruning the search space when searching for 
optimal solutions. As future work, we plan to investigate specific propagators, since 
the torsional energy contribution could be exploited for exact bounds estimations. 

Each computed structure is saved in pdb format. This is a standard format for 
proteins (detailed in the PDB repository) that can be processed by most protein 
viewers (e.g., Rasmol, ViewerLite, JMol). 

In order to further reduce the time to search for solutions, we have developed a 
logic programming implementation of Large Neighboring Search (LNS) (jShaw 1998[) . 
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LNS is a form of local search, where the search for the successive solutions is per- 
formed by exploring a "large" neighborhood. The traditional basic move used in 
local search, where the values of a small number of variables are changed, is re- 
placed by a more general move, where a large number of variables is allowed to 
change, and these variables are subject to constraints. The basic LNS routine is the 
following: 

1. Generate an initial solution (e.g., using the standard CLP labeling procedure). 

2. Randomly select a subset of the variables of the problem that are admissible 
for changes, and assign the previous values to the other variables. 

3. Using standard labeling, look for an assignment of the selected variables that 
improve the energy/cost function (or look for an assignment that optimize 
the energy/cost function). In any case, go back to step 2. 

A timeout mechanism is typically adopted to terminate the cycle between steps 2 
and 3. For example, the procedure is terminated if either a general time-out oc- 
curs or k iterations are performed without any improvement in the quality of the 
solution. In these cases, the best solution found is returned. 

This simple procedure can be improved by occasionally allowing a worsening 
move — this is important to enable the procedure to abandon a local minimum. 
The process can be implemented by introducing a random variable; whenever such 
variable is assigned a certain value (with a low probability), then an alternative 
move which worsens the quality of the solution is performed; otherwise we continue 
using the scheme described earlier. 

The scheme has been implemented in CLP^J-D). Even though the implementa- 
tion has been developed to meet the needs of the protein folding problem, the scheme 
can be adapted with minimal changes to address the needs of other constraint op- 
timization problems coded in CLP{J'V). The logical variables of CLP {FT)) ^ being 
single- assignment, are not suitable to the process of modifying a solution as re- 
quested by LNS. Therefore, a specific combination of backtracking, assertions and 
reassignment procedures are enforced, in order to reset only the assignments to the 
CSP (while the constraints are maintained) and to re-assign previous values to the 
variables that are not going to be changed. The remaining variables can assume 
any assignment compatible with the constraints. 

The implementation we developed avoids these repetitions, by using extra-logical 
features of Prolog to record intermediate solutions in the Prolog database (using 
the assert/retract predicates). The loss of declarativity is balanced by enhanced 
performance in the implementation of LNS. 

Fig. [5] summarizes the CLP{F'V) implementation of LNS. The predicate best 
is used to memorize the best solution encountered, while the predicate last_sol 
represents the last solution found. The first clause of Ins represents the core of the 
procedure — by setting up the constraints (constraint) and searching for solutions 
(local); the fail at the end of the clause will cause backtracking into the clauses 
of Ins that determine the final result (lines 7-12). If at least one solution has been 
found, then the final result is extracted from the fact of type best in the database. 

Starting from one solution (stored in the Prolog database in the fact last_sol), 
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the predicate anotlier_sol determines the next solution in the LNS process. If a 
solution has never been found, then a first solution is computed, by performing a 
labeling of the variables (lines 18-19). Otherwise, the values of some of the variables 
are modified as dictated by LNS (line 24) and a new solution is computed. Note that 
an additional constraint on the resulting value of the objective function (represented 
by the variable Energy) is added; with probability ^ (as determined by a random 
variable Type in line 21) a worsening of the energy is requested (line 22), while in 
all other cases an improvement is requested (line 23). If a new solution is found, 
this is recorded in the Prolog database (line 32); if this solution is better than any 
previous solution, then also the best fact is updated (line 36). Observe that an 
internal time-out mechanism (set to 120s — line 25) is applied also on the search 
of a new solution. The "cut" in line 30 is introduced to avoid the computation of 
another solution with the same random choices. 

The iterations of another_sol are performed by the predicate local (lines 13- 
15). The negated call to another_sol is necessary to enable removal of all variable 
assignments (but saving constraints between variables) each time a new cycle is 
completed. The local procedure cycles indefinitely. 

A final note on the procedure largejnove. We implemented two types of LNS 
moves. The first (large_pivot) makes a set of consecutive Code variables unbound, 
allowing the procedure to change a (large) bend. The other Code variables and the 
first segment of Tertiary variables remain assigned. The second instead leaves 
unbound two independent sets of consecutive variables, thus allowing a rotation of 
a central part of the protein (a sort of large_crankshaf t move). We use both types 
of moves during the search (alternated using a random selection process). 

5 Experimental results 

The prototype can search the first admissible solution (pf_id(lD .Tertiary) ), where 
ID is a protein name included in the database (prot_list.pl); the Primary se- 
quence is also admitted directly as input. It is possible to generate the first N solu- 
tions and output them as distinct models in a single pdb file, or to create different 
files for all the solutions found within a Timeout. Finally, LNS can be activated by 
pf_id_lns(ID, Timeout) . A version of the current CLP{J-'D) implementation, along 
with a set of experimental tests, is available at www . d imi . uniud . it/dovier/PF , 

The experimental tests have been performed on an AMD Opteron 2.2GHz Linux 
Machine. Each computation was performed on a single processor using SICStus 
Prolog 4.0.4. Some of the experimental results are reported in Table [TJ The exe- 
cution times reported correspond to the time needed to compute the best solution 
computed within the allowed execution time (s stands for seconds, m for minutes). 

We consider 8 proteins and perform an exhaustive search of 6.6 hours. For other 
4 proteins, we perform both enumeration (left) and LNS search for 2 days (right). 
Moreover we launch LNS for 2 hours (center) . 

For every protein, we impose the secondary structure information as specified in 
the corresponding PDB annotations. However, we wish to point out that the 12 
proteins tested are not included in the top-500 Database from which we extracted 



10 

11 

12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 



CLP-based protein fragment assembly 721 

lns(ID, Time) :- 

constraint (ID , Code , Energy , Primary .Tertiary) , 
retractalKbest (_,_,_)) , assert (best (0, Primary, _)) , 
retract all (last _sol (_,_,_)) , assert (last _sol(0, null, _)) , 
time_out(local (Code, Energy, Primary, Tertiary) , Time,_) , 
fail. 
lns(_,_) :- 

best(0,_,_,_), ! , 

writeClnsuf f icient time for the first solution') ,nl. 
lns(.,_) :- 

best (Energy, Primary, Tertiary) , 
print_results (Primary , Tert iary , Energy) . 
local (Code, Energy, Primary , Tertiary) :- 

\+ another_sol (Code, Energy, Primary, Tertiary) , 
local (Code, Energy, Primary , Tertiary) . 
another_sol(Code, Energy, Primary, Tertiary) :- 
last_sol(Lim, LastSol,LastTer) , 
( LastSol = null -> 

labeling (Code, Tertiary) ; 
LastSol \= null -> 

randomd , 11 ,Type) , 

(Type =< 1 -> Liml is 5*Lim//6, Energy #> Liml ; 
Type > 1 -> Energy #< Lim ) , 
large_move (Code , LastSol , Tertiary , LastTer) , 
time_out( labeling(Code, Tertiary) , 120000, Flag)), 
(Flag == success -> true ; 
Flag == time_out -> 

write ('Warning: Time out in the labeling stage'), nl, 
fail)), 
! _ 

retract(last_sol(_,_,_)) , 

assert (last_sol (Energy , Code , Tertiary) ) , 

best(Val,_,_), 

(Val > Energy -> 

retract (best (_,_,_)), 

assert (best (Energy , Primary , Tertiary) ) ; 
true) , 
fail. 



Fig. 5. An implementation of LNS in Prolog 



the 4-tuples. For each protein analyzed using LNS, we performed 4 independent 
runs, anticipated by a randomize statement, since the process relies on random 
choices. We report the best result (in terms of RMSD) and the energy associated. 

The enumeration search is expected to perform better for smaller proteins, where 
it is possible to explore a large fraction of the search space within the set time 
limit. It is interesting to note that, for smaller chains, the RMSD w.r.t. the native 
conformation in the PDB is rather small (ca. 3A); this indicates that the best 
solutions found capture the fold of the chain, and the determined solutions can be 
refined using molecular dynamics simulations, as done in ()Dal Palia et al. 2004p . 

Moreover, the proteins IZDD, IVII, and lAIL have been simulated both with 
enumeration and LNS search strategy, in order to show that the latter is able to 
improve both quality (i.e., RMSD) and computational time (up to 200 times faster), 
thanks to the neighborhood exploration. In the case of IZDD and lAIL, where there 
are three helices packed together, the random choice of good pivot moves effectively 
guides the folding towards the optimal solution. 
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PID 
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merate 6.6 hours 






RMSD 


Energy T (s) 


IKVG 


12 


2.79 


-59122 9.88 


lEDP 


17 


3.04 


-112755 73.00 


ILEO 


12 


3.12 


-45575 3.20 
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40 


5.96 


-266819 5794.88 
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1LE3 


16 


3.90 


-69017 


218.79 


lENH 


54 


5.10 


-467014 


8553.92 


IPGl 


18 


3.22 


-109456 


11.00 


2K9D 


44 


6.99 


-460877 


1453.44 



PID 


N 
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LNS 2 hours 






LNS 2 days 
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Energy T (m) 


RMSD 


Energy 


T(m) 


RMSD 


Energy 


T(m) 


IZDD 


34 


4.12 


-231469 1290 


3.81 


-226387 


6 


3.81 


-226387 


6 


lAIL 


69 


9.78 


-711302 301 


5.53 


-665161 


20 


5.44 


-668415 


109 


IVII 


36 


7.06 


-263496 1086 


6.64 


-252231 


48 


5.52 


-271178 


170 


2IGD 


60 


16.35 


-375906 2750 


10.91 


-447513 


27 


8.67 


-467004 


380 



Table 1. Computational results 

Finally, the protein 2IGD shows that the allowed time was not sufficient to de- 
termine a sensible prediction, even though the LNS shows enhancements in quality 
of the prediction, given the same amount of time. In this case, additional work is 
needed to improve the choice of moves performed by LNS, in order to explore the 
neighborhood in a more effective way. For example, moves that shift and translate 
some rigid parts of the protein may improve the performance and quality. 

We also note that the size of the set of fragments in use is sufficient to allow 
reasonable solutions, while allowing tractable search spaces. Clearly, the use of 
a more refined partitioning of the fragments (by reducing the rmsd_thr) would 
produce conformations that are structurally closer to the native state, at the price 
of an increase in the size of the search space. 





Fig. 6. Protein lENH: native state (left), prediction — green/light gray — compared 
to original — red/dark grey — (center), prediction (right) 



In Figure m we depict the native state (on the left) and our best prediction (on 
the right) for the protein lENH with the backbone and centroids. In the middle, 
we show the RMSD overlap of the Ca atoms between the native conformation 
(red/dark gray) and the prediction (green/light gray). The main features are pre- 
served and only the right loop that connects the two helices appears to have moved 
significantly. This could be avoided by introducing a richer set of alternative frag- 
ments in that area and thus allow a more relaxed placement of the fragments. 

An important issue is that it is not obvious that the reduced representation and 
the energy function used here are able to distinguish the native structure from 
decoys constructed to minimize that energy function. The straightforward compar- 
ison between RMSD and energy is not completely meaningful because the native 
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structure should be energy-minimized before energy comparison. This can be noted 
by comparing the conformations obtained by enumeration and LNS. The model in 
use improves w.r.t. the Ca model as presented in (jDal Palri et al. 2004p . However, 
some fine tuning of the energy coefficients is still necessary in order to improve the 
correlation, while preserving the overall constraint model. This will be a further 
area of investigation. Our results show that the method can scale well and that 
further speed-up may be obtained by considering larger fragments as done by tools 
like Rosetta. Rosetta is in fact the state-of-the-art predictor tool (e.g., the small 
protein lENH is predicted by Rosetta in less than one minute with a RMSD of 4.2 

A). 

6 Conclusions and future work 

In this paper we presented the design and implementation of a constraint logic 
programming tool to predict the native conformation of a protein, given its primary 
structure. The methodology is based on a process of fragments assembly, using 
templates retrieved from a protein database, that is clustered according to shape 
similarity. We used templates based on sequences of 4 amino acids. The constraint 
solving process takes advantage of a large neighboring search strategy, which has 
been shown to be particularly effective when dealing with longer proteins. 

The preliminary experimental results confirm the strong potential for this frag- 
ment assembly scheme. The proposed method has a significant advantage over 
schemes like Rosetta — the use of CLP{FV) enables the simple addition of ad- 
hoc constraints and experimentation with different local search moves and energy 
functions. The implementation presented here constitutes a proof of principle. 

In order to make this a useful tool in a realistic prediction scenario several im- 
provements must be implemented. The choice of 4-residue fragment will be im- 
proved in the next future in two directions: fragments will be chosen based on 
sequence or profile alignment (rather than exact match) against a non-redundant 
representative set of sequences whose structure is known; the size of the fragment 
will be chosen based on the alignment and will not be restricted to 4-residues. 

The reduced representation used here should be replaced by an all-atom repre- 
sentation at least for the backbone atoms whose hydrogen bonds define the proper 
relative arrangement of beta-strands not belonging to the same fragment. In gen- 
eral we plan to test different energy functions that better correlate with RMSD 
w.r.t. the (known) native structures and the computed ones. It is likely that with 
sequences longer than those considered here predictions will not be equally good 
in all parts of the molecule, therefore alternative measurements of similarity like 
GDT-TS (jZemla 2003P might be more appropriate. 

Other (possibly redundant) constraints and constraint propagation techniques 
should be analyzed, including the migration to the C-|--|- solver Gecode. 
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